Project - Feature Selection, Model Selection and Tuning¶Problem Statement: Concrete Strength Prediction¶Objective¶Steps and Tasks:¶Feature Engineering techniques(10 marks):
3.1. Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth.
3.2 Get the data model ready and do a train test split.
3.3 Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree.
Attribute Information:¶Given are the variable name, variable type, the measurement unit, and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.
Input variables:¶ Name -- Data Type -- Measurement -- Description
Output variable (desired target):¶import warnings
warnings.filterwarnings('ignore')
import pandas as pd #Read files
import numpy as np # numerical libraries
# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import Image
from os import system
pd.options.display.float_format = '{:,.4f}'.format
# Below we will read the data from the local folder
df = pd.read_csv("concrete.csv")
# Now display the header
print ('concrete.csv data set:')
df.head(10)
df.tail() ## to know how the end of the data looks like
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
print(f"The given dataset contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The given dataset contains {df.isna().sum().sum()} Null value")
df.shape # size of the data set also shown in the cell above
neg_exp=df[df.lt(0)] # this is to see the number of negative values present
print (" the number of negative entries is",len(neg_exp.index))
# the output might be taken in consideration later on in the calculations.
df.describe().transpose() # Transpose is used here to read better the attribute
df.nunique() # Number of unique values in a column
# this help to identify categorical values.
# Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
n = df[a].unique()
# if number of unique values is less than 30, print the values. Otherwise print the number of unique values
if len(n)<30:
print(a + ': ')
print(df[a].value_counts(normalize=True))
print()
else:
print(a + ': ' +str(len(n)) + ' unique values')
print()
plt.subplots(figsize=(20, 20))
ax = sns.boxplot(data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
As expected (Insight 2) is showing more amont of outliers and also less outliers in other columns: Slag, water, superplastic and fineaggthe pdays column seems to have continue values as seen in the boxplot and maybe will no add any value to the calculation.these boxplot will be evalauted with histograms belowdf.hist(stacked=False, bins=50, figsize=(30,30), layout=(3,3)); # Histogram of the tentative continous variable
# ***Please notice that some of these variable can be dropped after the graphical evaluation.
## Please notice: I found this commands on internet and it allows a better and faster visualitzation of multiples displot,
## compared with how i did it in the previous project and with the line above.
#### With this code we can also see the mean of each variable which is better than the histogram plotted above.
import itertools
import statistics
cols = [i for i in df.columns]
fig = plt.figure(figsize=(20, 25))
for i,j in itertools.zip_longest(cols, range(len(cols))):
plt.subplot(5,2,j+1)
ax = sns.distplot(df[i],color='blue',rug=True)
plt.axvline(df[i].mean(),linestyle="dashed",label="mean", color='black')
plt.axvline(statistics.mode(df[i]),linestyle="dashed",label="Mode", color='Red')
plt.axvline(statistics.median(df[i]),linestyle="dashed",label="MEDIAN", color='Green')
plt.legend()
plt.title(i)
plt.xlabel("")
df.corr() # with this function will try to see a correlation between variables numerically
has higher correlation (>= 30%) with Cement,Superplastic, age and 29% with water(negative correlation)has a minor correlation (<1%) with the rest of variables.I will try to test a model without slag and ash since they seems to have low impact in the target variableshows a good negative correlation ~ -40%have a strong negative correlation of -65%g = sns.PairGrid(df)
g.map_upper(plt.scatter)
g.map_lower(sns.lineplot)
g.map_diag(sns.kdeplot, lw=3, legend=True);
sns.pairplot(df , hue='age' , diag_kind = 'kde')
plt.show()
here we see, in the diagonal, a almost normal distribution for cement, water, coarseagg,fineagg and strngthAlso we see a double gaussian (explained in the lectures) + skewness ( insight #5) in slag, ash, superlastic and agewe can also confirm graphically the observations in the Insight #6 about the correlation beteween the target variable and between other variables, however I found it easier to read the number than this graphicsHere we see that we actually have different gaussians in the diagonal for every variable if we split the age. Hence it will be interesting to change the age variable in smaller groups is evaluate the results later onin this pairplot: Ash, superplastic and slag variable show a more clear the double gaussian#Another correlation methods
plt.figure(figsize=(10,10))
mask = np.zeros_like(df.corr('spearman'))
mask[np.triu_indices_from(mask)] = True
ax =sns.heatmap(df.corr(),
annot=True,
linewidths=.5,
center=0,
cmap="YlGnBu",
mask= mask,
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
plt.show()
With this plot it is easy to recongnise the findings observed in the Insight #6 where it takes longer to find the correlation between the variablesAdditional to Insights #6, here we also find correlation between other variables positive and negatively but less than 30%. this plot can be reviewed later on to determine the relevance of some variablesIt is confirmed that the 3 most important variables for the target are: cement, superplastic and age#pip install pandas-profiling[notebook]
from pandas_profiling import ProfileReport
profile = ProfileReport(df)
profile
The main takeaway of this step is that for the next projects I will apply it at the beggning of the study. It give an idea of where to look at to see more details, like relevant correlations and more important variablesMost of the observations from this report about the data were mentioned alreadyX = df.drop('strength', axis=1) # Seperating the target and the rest
Y = df['strength']
#from sklearn.model_selection import train_test_split # Splitting the data for training and testing out model
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)
x_train.head() # this is to review the columns
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
x_train.dtypes
from sklearn.linear_model import LinearRegression, LogisticRegression,Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from yellowbrick.classifier import ClassificationReport, ROCAUC
from sklearn.svm import SVR
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#Linear Regression
linR = LinearRegression()
linR.fit(x_train, y_train)
pred = linR.predict(x_test) # Predictions from linear regression
score0_train = linR.score(x_train, y_train)
score0_test = linR.score(x_test, y_test)
#Linear Regression with Polynomial features of degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
pipeline.fit(x_train,y_train)
score1_train = pipeline.score(x_train,y_train) # Predictions from linear regression degree 2
score1_test= pipeline.score(x_test,y_test)
print('Score Linear Regression degree 2_train', score1_train)
print('Score Linear Regression degree 2_test', score1_test)
#Linear Regression with Polynomial features of degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
pipeline.fit(x_train,y_train)
score2_train = pipeline.score(x_train,y_train) # Predictions from linear regression degree 2
score2_test= pipeline.score(x_test,y_test)
print('Score Linear Regression degree 3_train', score1_train)
print('Score Linear Regression degree 3_test', score1_test)
#dt = DecisionTree Regressor
dt = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
dt.fit(x_train, y_train)
score3_train= dt.score(x_train, y_train)
score3_test= dt.score(x_test, y_test)
pred_dt = dt.predict(x_test)
#dt = DecisionTree Regressor degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
pipeline.fit(x_train,y_train)
score4_train = pipeline.score(x_train,y_train)
score4_test= pipeline.score(x_test,y_test)
print('Score Dt Regressor degree 2_train', score1_train)
print('Score Dt Regressor degree 2_train', score1_test)
rf = RandomForestRegressor(random_state=42, max_depth=4)
rf.fit(x_train, y_train)
score5_train = rf.score(x_train, y_train)
score5_test= rf.score(x_test, y_test)
ls=Lasso(random_state=42)
ls.fit(x_train, y_train)
pred_ls = ls.predict(x_test)
score6_train=ls.score(x_train, y_train)
score6_test= ls.score(x_test, y_test)
print (score0_train, score1_train, score2_train, score3_train, score4_train, score5_train, score6_train)
print (score0_test, score1_test, score2_test, score3_test, score4_test, score5_test, score6_test)
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
#Linear Regression
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]
#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Lasso
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
## The next lines are to select the best model based on the k_fold_mean and add it in the last row
tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)
## This is the table with the scoring result of the algorithm with all the data
results
#1st I will try to count the outliers in the variables as seen in the previous mentored sessions.
#Age was showing in the EDA the bigger ammount of outliers
q1= df.quantile(0.25)
q3= df.quantile(0.75)
IQR = q3-q1
low = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range
outliers = pd.DataFrame(((df > (high)) | (df < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)
outliers
q1 = df['age'].quantile(0.25) #first quartile value
q3 = df['age'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range
df_in = df.loc[(df['age'] >= low) & (df['age'] <= high)] # meeting the acceptable range
df_out = df.loc[(df['age'] < low) | (df['age'] > high)] # not meeting the acceptable range
age_mean=int(df_in.age.mean()) #finding the mean of the acceptable range
print('age_mean =', '' ,age_mean)
print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)
#imputing outlier values with mean value
df_out.age=age_mean
#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
## The code from above is repeated to check the outliers again
q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range
outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)
outliers
# Here i will replace the outliers of age by "higher acceptable range" due ot using mean the #of outliers increased from 59 to 131
q1 = df['age'].quantile(0.25) #first quartile value
q3 = df['age'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range
df_in = df.loc[(df['age'] >= low) & (df['age'] <= high)] # meeting the acceptable range
df_out = df.loc[(df['age'] < low) | (df['age'] > high)] # not meeting the acceptable range
print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)
#imputing outlier values with mean value
df_out.age=high
#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
## checking the outliers again
q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range
outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)
outliers
q1 = df_rev['superplastic'].quantile(0.25) #first quartile value
q3 = df_rev['superplastic'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range
df_in = df_rev.loc[(df_rev['superplastic'] >= low) & (df_rev['superplastic'] <= high)] # meeting the acceptable range
df_out = df_rev.loc[(df_rev['superplastic'] < low) | (df_rev['superplastic'] > high)] # not meeting the acceptable range
superplastic_mean=int(df_in.superplastic.mean()) #finding the mean of the acceptable range
print('Superplastic_mean =' ,superplastic_mean)
print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)
#imputing outlier values with mean value
df_out.superplastic= superplastic_mean
#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df_rev.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
## The code from above is repeated to check the outliers again
q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range
outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)
outliers
X = df_rev.drop('strength', axis=1) # Seperating the target and the rest
Y = df_rev['strength']
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)
x_train.head() # this is to review the columns
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
#Linear Regression
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]
#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Lasso
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
## The next lines are to select the best model based on the k_fold_mean and add it in the last row
tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)
## This is the table with the scoring result of the algorithm with all the data
results
# these are the variable ('ash', 'coarseagg', 'fineagg') with less correlation observed during the EDA
# 'Slag' could be dropped and evaluate the model later on. In this project I will not try due to lack of time
print (' Original columns: ',df_rev.columns)
df_rev2=df_rev.drop(['ash', 'coarseagg', 'fineagg'],axis=1)
print (' New columns: ', df_rev2.columns)
X = df_rev2.drop('strength', axis=1) # Seperating the target and the rest
Y = df_rev2['strength']
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)
x_train.head() # this is to review the columns
num_folds = 50
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
#Linear Regression
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]
#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
#Lasso
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'],
'score_training': round(model.score(x_train, y_train),3),
'score_test':round(model.score(x_test, y_test),3),
'k_fold_mean':mean,
'k_fold_std':std,
'95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
})
results = pd.concat([results, tempresults])
## The next lines are to select the best model based on the k_fold_mean and add it in the last row
tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)
## This is the table with the scoring result of the algorithm with all the data
results
### Gradient boosting is the model that best predict the strength of concrete based on the given dataset and the studied models
### In the following steps I will try to tune the parameters of Gradient boosting
### Since the drop of the variable didn't improved the model (insight #12) we will use the df_rev which is after correcting the outliers
X = df_rev.drop('strength', axis=1) # Seperating the target and the rest
Y = df_rev['strength']
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape, y_train.shape, y_test.shape)
x_train.head() # this is to review the columns
from sklearn.model_selection import RandomizedSearchCV
GradientBoostingRegressor
# Prepare parameter grid
# Please notice: due to the lack of time to complet the project I took the parameters configuration below from internet,
# I checked different source and they had similar parameters with similar values.
# In real life I would need to learn more about this algorithm and the use of each parameters.
parameters = {
'criterion': ['mse', 'mae', 'friedman_mse'],
'learning_rate': [0.05, 0.1, 0.15, 0.2],
'max_depth': [2, 3, 4, 5],
'max_features': ['sqrt', None],
'max_leaf_nodes': list(range(2, 10)),
'n_estimators': list(range(50, 500, 50)),
'subsample': [0.8, 0.9, 1.0]
}
rs = RandomizedSearchCV(estimator=GradientBoostingRegressor(random_state=42), param_distributions=parameters,
return_train_score= True, n_jobs=-1, verbose=2, cv = 10, n_iter=500)
rs.fit(x_train, y_train)
mean = rs.best_score_
std = rs.cv_results_['mean_test_score'].std()
print(f"Mean training score: {rs.cv_results_['mean_train_score'].mean()}")
print(f"Mean validation score: {mean}")
print(f"Validation standard deviation: {std}")
print(f"95% confidence interval: {str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))}")
print(f"Best parameters: {rs.best_params_}")
print(f"Test score: {rs.score(x_test, y_test)}")
The best model is Gradient boosting having a high and similar score for the training and test data
Using this model is possible to predict the strength accurately between 82% to 100% with 95% confidence.